Adding 16bpp Pixels

I wrote this article as a response to Rawhed's article in Hugi #16 on the topic. I want to present a better way of adding 16bpp pixels. First I will show two routines, one that does simple adding with saturation and the other which creates an average of pixels. Then I will draw a description of these routines.

Routines

The two routines start in the same way, but end up differently. By right of each instruction I marked a clock cycle in which it is issued. '-' means that an instruction is issued, due to pairing, in the same cycle as the previous one. The cycles correspond to Pentium processor.

The routines process two pairs of pixels in parallel, as described later. Registers esi and edi both contain pointers to pixels. A resulting pair of pixels is put at [edi]. Registers eax, ecx, edx and ebx get destroyed.

                   ; Load pixels and decompose them
                           mov     eax, [esi]      ; 1
                           mov     ecx, [edi]      ; -
                           mov     edx, eax        ; 2, cache hit assumed
                           mov     ebx, ecx        ; -
                           and     eax,  07E0F81Fh ; 3
                           and     ecx,  07E0F81Fh ; -
                           and     edx, 0F81F07E0h ; 4
                           and     ebx, 0F81F07E0h ; -

   ; Sum up the pixels                     ; Calculate average
           add     eax, ecx ; 5                    add     eax, ecx ; 5
           add     edx, ebx ; -                    rcr     eax, 1   ; 6
                                                   add     edx, ebx ; -
   ; Check carries and saturate edx                rcr     edx, 1   ; 7
           sbb     ecx, ecx        ; 6
           xor     ebx, ebx        ; -     ; Compose and store pixels
           and     ecx, 0F8000000h ; 7             or      eax, edx   ; 8
           test    edx,  00200000h ; -             mov     [edi], eax ; 9
           setz    bl              ; 8
           or      edx, ecx        ; 9
           xor     ecx, ecx        ; -
           dec     ebx             ; 10
           test    edx, 00000800h  ; -
           setz    cl              ; 11
           and     ebx, 001F0000h  ; 12
           dec     ecx             ; -
           or      edx, ebx        ; 13
           and     ecx, 000007E0h  ; -
           or      edx, ebx        ; 14

   ; Check carries and saturate eax
           xor     ecx, ecx        ; -
           and     edx, 0F81F07E0h ; 15
           test    eax, 08000000h  ; -
           setz    cl              ; 16
           xor     ebx, ebx        ; 17
           dec     ecx             ; -
           and     ecx, 07E00000h  ; 18
           test    eax, 00010000h  ; -
           setz    bl              ; 19
           or      eax, ecx        ; -
           xor     ecx, ecx        ; 20
           dec     ebx             ; -
           and     ebx, 0000F800h  ; 21
           test    eax, 00000020h  ; -
           setz    cl              ; 22
           or      eax, ebx        ; 23
           dec     ecx             ; -
           and     ecx, 0000001Fh  ; 24
           and     eax, 07E0F81Fh  ; -
           or      eax, ecx        ; 25

   ; Compose and store pixels
           or      eax, edx        ; 26
           mov     [edi], eax      ; 27

In the end, the procedure that adds pixels with saturation (left) takes 27 clocks, and as it processes pixels in parallel, it takes 13.5 cycles per pixel. The other procedure that calculates average of two pixels (right) takes 9 clocks, with 4.5 clock per pixel.

How does it work

Generally, pixels are formed in the following way: 5 bits for red color component, 6 for green and 5 for blue. An example layout:

                            15             0
                            0010110011000110
                            Àred-Àgren-Àblu-

As a pixel takes exactly 16 bits, two pixels fit in a 32-bit register. So if we have two pixels from image A in register eax, and two pixels from image B (of corresponding position) in register ecx, we can simply do 'add eax, ecx' to get the two sums of two pixel pairs. The only problem are carries, which are added to adjacent components. So we just smartly split the components of the pixels, do the addition, then either saturate or shift it right, and compose the pixels back. See the example:

                     31                             0
   eax from image A: 00101100110001100110001111011111
   ecx from image B: 10010011011010000001000001011111
                     Ãred-Àgren-Àblu´Ãred-Àgren-Àblu´
                     À----pixel-2----À----pixel-1----

Code that splits the pixels:

        ; copy the pixels
                mov     edx, eax
                mov     ebx, ecx
        ; mask off even components
                and     eax, 00000111111000001111100000011111b ; 07E0F81Fh
                and     ecx, 07E0F81Fh
        ; mask off odd components
                and     edx, 11111000000111110000011111100000b ; 0F81F07E0h
                and     ebx, 0F81F07E0h

And contents of registers resulting from the splitting:

                     31                             0
                eax: -----100110-----01100------11111
                edx: 00101------00110-----011110-----
                ecx: -----011011-----00010------11111
                ebx: 10010------01000-----000010-----
                     Ãred-Àgren-Àblu´Ãred-Àgren-Àblu´
                     À----pixel-2----À----pixel-1----

'-' indicates 0 put by the splitting routine with 'and'.

The rest is simple. Maybe the carry checking in the saturation code is the only thing that requires explanation. Watch this:

        xor     ebx, ebx        ; ebx = 00000000h
        test    eax, 00200000h  ; zf=1 if bit21 is clear, otherwise zf=0
        setz    bl              ; ebx=1 if bit21=0, otherwise ebx=0
        dec     ebx             ; ebx=0 if bit21=0, otherwise ebx=0FFFFFFFFh
        and     ebx, 001F0000h  ; mask off pix2.red (bit21 was the carry)
        or      eax, ebx        ; set pix2.red to max value if overflowed

This way of carry checking has the advantage of taking 5 clocks with no branch. Intermixed with other code theoretically takes 3.5 clocks. A branch version takes 2 (no carry, predicted) to 10 (carry, mispredicted) with an average of 6 clocks on Pentium and 1 to 18(!) with an average of 9 clocks on Pentium II, thus this no-branch version is better.

I do not have any idea whether the given routines are the best. I only know that on Pentium the adding with saturation takes 13.5 clocks per pixel and the pixel average takes 4.5 clocks per pixel.

Chris Dragan . cdragan@ams.ampr.org